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Abstract — Model selection in clustering requires (i) to specify 
a suitable clustering principle and (ii) to control the model 
order complexity by choosing an appropriate number of clus- 
ters depending on the noise level in the data. We advocate 
an information theoretic perspective where the uncertainty in 
the measurements quantizes the set of data partitionings and, 
thereby, induces uncertainty in the solution space of clusterings. A 
clustering model, which can tolerate a higher level of fluctuations 
in the measurements than alternative models, is considered to 
be superior provided that the clustering solution is equally 
informative. This tradeoff between informativeness and robustness 
is used as a model selection criterion. The requirement that data 
partitionings should generalize from one data set to an equally 
probable second data set gives rise to a new notion of structure 
induced information. 

I. Introduction 

Data clustering or data partitioning has emerged as the 
workhorse of exploratory data analysis. This unsupervised 
learning methodology comprises a set of data analysis tech- 
niques which group data into clusters by either optimizing 
a quality criterion or by directly employing a clustering 
algorithm. The zoo of models range from centroid based algo- 
rithms like fc-means or fc-medoids, spectral graph meth- 
ods like Normalized Cut, Average Cut or Pairwise 
Clustering to linkage inspired grouping principles like 
Single Linkage, Average Linkage or Path-based 
Clustering. 

The various clustering methods and algorithms ask for a 
unifying meta-principle how to choose the "right" clustering 
method dependent on the data source. This paper advocates 
a shift of viewpoint away from the problem "What is the 
'right' clustering model?" to the question "How can we 
algorithmically validate clustering models?". This conceptual 
shift roots in the assumption that ultimately, the data should 
vote for their prefered model type and model complexity |j4|. 
Therefore, algorithms which are endowed with the ability 
to validate clustering concepts can maneuver through the 
space of clustering models and, dependent on the training and 
validation data sets, they can select a model with maximal 
information content and optimal robustness. 

In this paper, we propose an information theoretic model 
validation strategy to select clustering models. A clustering 
model is used to generate a code for communication over a 
noisy channel. "Good" models are selected according to their 
robustness to noise. The approximation precision of clustering 
solutions is controlled by an algorithm called empirical risk 
approximation (ERA) |2| which quantizes the hypothesis class 



of clusterings. ERA employs an hypothetical communication 
framework where sets of approximate clustering solutions for 
the training and for the test data are used as a communica- 
tion code. Approximations of the empirical minimizer with 
model averaging over approximate solutions favors stability 
of clusterings. Furthermore, it is well known that stability 
based model selection 1 8 1 yields highly satisfactory results in 
applications although the theoretical foundation of this model 
selection strategy is still controversially debated HI. 

II. Statistical learning of clustering 

Given are a set of objects O = {oi,...,o„} G O 
and measurements X e A" to characterize these objects. 
0,X denotes the object or measurement space, respectively. 
Such measurements might be d-dimensional vectors X = 
{X, < i < n} or relations D = {D,j) e M"" 

which describe the (dis)-similarity between object and Oj. 
More complicated data structures than vectors or relations, 
e.g., three-way data or graphs, are used in various applica- 
tions. In the following, we use the generic notation X for 
measurements. We have to distinguish between objects and 
measurements since repeated measurements might refer to the 
same object. Data denote object-measurement relations OxX, 
e.g., vectorial data {Xi : 1 < i < n} describe surjective 
relations between objects oi and measurements Xi :— X{oi). 

The hypotheses for a clustering problem are the functions 
assigning data to groups, i.e., 

c : OxX {!,..., k}" 

(0,X) ^ c(0,X) (1) 

The parameter n = |0| denotes the number of objects. In 
cases where X uniquely identifies the object set O, i.e., there 
exists a bijective function between objects and measurements, 
then we omit the first argument of c to simplify notation. A 
clustering is then denoted byc;A'^-{l,..., fc}". 

The hypothesis class for a clustering problem is defined as 
the set of functions assigning data to groups, i.e., C(X) = 
{c(0,X) : O e O}. For n objects we can distinguish 
Olk"^) such functions. Specific clustering models might re- 
quire additional parameters 9 which characterize a cluster, e.g., 
the centroids in fc-means clustering. The hypothesis class is 
then the product space of possible assignments and possible 
parameter values. 



III. Clustering costs and empirical risk 

APPROXIMATION 

Exploratory pattern analysis and model selection for group- 
ing requires to assess the quality of clustering hypotheses. 
Various criteria emphasize coherency of data or connectedness, 
e.g., fc-means clustering measures the average distance of data 
vectors to the nearest cluster centroid or prototype. For the sub- 
sequent discussion on information theoretic model validation, 
a cost or risk function R{c, X) is assumed to measure how 
well a particular clustering with assignments c(X) and cluster 
parameters 6 groups the objects. To simplify the notation, 
cluster parameters 9 are not explicitly listed as arguments of 
clustering costs but are subsumed in the specification of the 
cost function R. A suitable metric for the space of hypotheses 
might be chosen based on such a cost function R. 

The clustering solution c^(X) minimizes the empirical risk 
(ERM) of data clustering given the measurements X, i.e.. 



problem generator ?pc5 



c^(X) = arg mill i?(c,X). 



(2) 



Clustering solutions which are similar in costs to the ERM 
solution c-'-(X) define the set C^(X.) of empirical risk ap- 
proximations for clustering, i.e., 

C^(X):={c(X) : i?(c, X) < i?(c^, X) + 7}. (3) 

The set C-y(X) reduces to the ERM solution in the limit 
lim^^oC^(X) = {c-L(X)}. 

To validate clustering methods we have to define and 
estimate the generalization performance of parti tionings. We 
adopt the two sample set scenario with training and test data 
which is widely used in statistics and statistical learning theory 
lim i.e. to bound the deviation of empirical risk from expected 
risk, but also for two-terminal systems in information theory 
H. We assume for the subsequent discussion that training data 
and test data are described by respective object sets 0(^\ O'^^ 
and measurements X'^^^X'^^' ^ P(X) which are drawn i.i.d. 
from the same probability distribution P(X). Furthermore, 
X(i),X*^^-' uniquely identify the training and test object sets 
0(1), O^^) so that it is sufficient to list X^^^ as references to 
object sets 0'^^\j = 1, 2. 

Statistical inference requires that clustering solutions have 
to generalize from training data to test data since noise in 
the data renders the ERM solution c^{X^^'>) ^ c-'-(X(2)) 
unstable. How can we evaluate the generalization properties 
of clustering solutions? Before we can evaluate the clustering 
costs X*^^^) on test data of the ERM clustering on training 
data c-'-(X'^^') we have to identify a clustering c G C(X'^') 
which corresponds to c-'-(X*^^''). A priori, it is not clear how 
to compare clusterings c(X(^)) for measurements X^^^ with 
clusterings c(X(^)) for measurements X^^^. Therefore, we 
define the mapping 



c(X(i)) 



C(X(2)) 
i/)oc(X(i)) 



(4) 



sender 6 



i?(-,XW) i?(-,XW) 



{(Ti, . . . , cr2"«} 



receiver 



which identifies a clustering hypothesis for training data 
c G C(X'^^') with a clustering hypothesis for test data 



Fig. 1. Generation of a set of 2"^ code problems for communication by 
e.g. permuting the object indices. 



o c G C(X(^^). The reader should note that such a mapping 
?/; might change the object indices. In cases when the mea- 
surements are elements of an underlying metric space, then a 
natural choice for t/j is the nearest neighbor mapping v(%) = 
arg mm^ where we identify clustering c(X*^^^) 

with ^ o c(X(i)) = (c(4J|)), c{X%), c(4Ji))). 

The mapping ^ enables us to evaluate clustering costs on 
test data X^^-* for clusterings c(X^^'') selected on the basis 
of training data X^^^ Consequently, we can determine how 
many 7-optimal training solutions are also 7-optimal on test 
data, i.e., AC(X(i) , X^^)) := | (^/, o C^(X(i))) n C^(X(2))|. 
A large overlap means that the training approximation set 
generalizes to the test data, whereas a small or empty in- 
tersection indicates the lack of generalization. Essentially, 7 
parametrizes a coarsening of the hypothesis class such that sets 
of data partitionings become stable w.r.t measurement fluc- 
tuations. The tradeoff between stability and informativeness 
is controlled by minimizing 7 under the constraint of large 
AC(X(i),X(2))/|C^(X(2))| for given risk function R{.,X.). 

IV. Coding by Approximation 

In the following, we describe a communication scenario 
with a sender (3, a receiver $H and a problem generator 
^& where the problem generator serves as a noisy channel 
between sender and receiver Communication takes place 
by approximately optimizing clustering cost functions, i.e., 
by calculating approximation sets C-y(X*^^-'),C^(X''^-'). This 
coding concept will be refered to as approximation set coding 
(ASC). The noisy channel is characterized by a clustering cost 
function R{c, X) which determines the channel capacity of the 
ASC scenario. Validation and selection of clustering models is 
then achieved by maximizing the channel capacity over a set 
of cost functions Rg{.,X.),9 G & where 9 indexes the various 
clustering models. 

Sender © and receiver D\ agree on a clustering principle 
i?(c,X(i)) and on a mapping function V. The following 
procedure is then employed to generate the code for the 
communication process: 

1) Sender & and receiver 5K obtain a data set X*^^) from 
the problem generator *p©. 

2) & and *H calculate the 7-approximation set C-y(X^^)). 

3) S generates a set of (random) permutations E :— 
{(Ji, . . . , a2nR} to rename the objects. The permutations 
define a set of optimization problems R{c,aj o X'^') 



problem generator 



sender 



receiver 



Fig. 2. Communication process: (1) the sender selects transformation CTs, 
(2) the problem generator draws X(^) ~ P(X) and applies a a to it, and the 
receiver estimates a* based on X = cr. o X(^). 



analyse the error probability of approximation set coding and 
the channel capacity which is associated with a particular cost 
function i?(.,X). This channel capacity will be refered to as 
approximation capacity since it determines the approximation 
precision of the coding scheme. 

A communication error occurs if the sender selects a-g and 
the receiver decodes a = aj,j s. To estimate the probability 
of this event, we introduce the sets 



with associated approximation sets Cy{aj o X*^^)),! < 
j < 2"^. 

4) & sends the set of permutations S to $H who determines 
the approximation sets {Cj{cri o X^^))}^^]^ . 

The rationale behind this procedure is the following: Given 
the measurements X*^^' the sender has randomly covered the 
set of clusterings C(X'^^^) by respective approximation sets 
{C((Ti o X*^^') : 1 < « < 2"^}. Communication succeeds 
if the approximation sets are stable under the stochastic 
fluctuations of the measurements. The criterion for reliable 
communication is defined by the ability of the receiver to 
identify a specific permutation that has been selected by the 
sender. The approximation sets C{cri o X*^^^) play the role of 
codebook vectors in Shannon's theory of communication. 

After this setup procedure, both sender and receiver have 
a list of approximation sets available or can algorithmically 
determine membership of clusterings in one of the 2"^ ap- 
proximation sets. 

How is the communication between sender and receiver 
organized? During communication, the following steps take 
place as depicted in fig- 12] 

1) The sender & selects a permutation cr^ as message and 
send it to the problem generator *p©. 

2) generates a new data set X'^'^' and it applies the 
selected permutation to X'^', yielding X = CTs o X'^^. 

3) send X to the receiver D\ without revealing ag. 

4) 9\ calculates the approximation set C^(X) 

5) estimates the applied permutation CTs by using the 
decoding rule 



a = BIS max 

creS 



(VoC^('ToX(i))) nc-,(x) 



(5) 



This communication channel supports to communicate at 
most nlogA; nats if two conditions hold: (i) the channel is 
noise free X^^-' = X^^^^; (ii) all clusters have the same number 
of objects assigned to. 

It is worth mentioning that ASC is conceptually not re- 
stricted to clustering problems although we focus the discus- 
sion here to this problem domain. 

V. Error analysis of Approximation Set Coding 

To determine the optimal approximation precision for an 
optimization problem R{., X) we have to determine necessary 
and sufficient conditions which have to hold in order to 
reliably identify approximation sets. Reliable identification of 
approximation sets enable us to define a communication pro- 
tocol using the above described coding scheme. Therefore, we 



ACj := U o C-,(ctj o x(i))) n C^(X(2)), aj G S. (6) 



The set ACj measures the intersection between the approxima- 
tion set Cj {aj o X^^^^) for dj-permuted measurements and the 
approximation set which has been calculated by the receiver 
based on the test data X. 

The probability of a communication error is given by a 
substantial overlap ACj with cr j e S \ {cTs}, i.e.. 



max I AC, I > I AC, 



a,e^\{<T,} 

< P(|ACj| > |AC,||a,) (7) 

<T,-es\{<Tj 

The notation X^^'^) = (x(i),X(2)) and I/n = ^ ""'^ 
^ ' ^ ^-^^ \0 otherwise 

is used. The inequality in (jTji is caused by the union bound. 
The confusion probability with message aj,j ^ s for given 
training data X^^' and test data X*^^) conditioned on is 
defined by 



1{|AC,|>|AC,|}J 



^I{log|AC,|>log|AC,|} 



(a) 
< 



E 

1 



exp(log|ACj|-log|AC, 



I AC, 



(6) 
(c) 



|C,(XW)||C,(X(^))| 
IK}||AC,| 

exp {—nJ-y [oj , (t)) 



(8) 



The expectation Eo-j [I{|aCj|>|ac,|}] in derivation (jsj) is con- 
ditioned on <7s which has been omitted to increase the readabil- 
ity of the formulas. The summation {cTj } is indexed by all pos- 
sible realizations of the transformation aj that are uniformly 
selected, (a) we have used the inequality I{2;>o} < exp(a:); 
(b) averaging over a random permutation aj of object indices 
breaks any statistical dependence between sender and receiver 
approximation sets which corresponds to the error case in 
jointly typical coding |51; (c) we have introduced the mutual 
information between the uniform distribution of the sender 



(9) 




message aj and the receiver message a 

To compactify the formula, the following notation is intro- 
duced: C(^) := C(X«),C^'' C-,(X«),i = 1,2. The 
interpretation of eq. Q is straightforward: The first logarithm 
measures the entropy of the number of transformations which 
can be resolved with an uncertainty of C^^^ in the space of 
clusterings on the sender side. The logarithm log(|C(2) j/jC^^^ |) 
calculates the entropy of the receiver clusterings which are 

(2) 

quantized by C-;. . The third logarithm measures the joint en- 
tropy of (fjj , &) which depends on the size of the intersection 

I AC, I = I (t/^ o (a, o x(i) )) n (d, o X(2) ) I . 

Inserting (|8]l into (j7|i yields the upper bound for the error 
probability 

V{(T ^ (Ts\as) < exp(n_Rlog2) exp (— nl-y((7j, (t)) 

= exp(-n(I^ (cTj, (t) - i? log 2)) (10) 

The communication rate nR log 2 is limited by the mutual 
information a) for asymptotically error-free communi- 

cation. 

VI. Information theoretical model selection 

The analysis of the error probability suggests the following 
inference principle for model selection: the approximation 
precision is controlled by 7 which has to be minimized to 
derive more expressive clusterings. For large 7 the rate R will 
be low since we resolve the space of clusterings in only a 
coarse grained fashion. For too small 7 the error probability 
does not vanish which indicates confusions between aj and ag. 
The optimal 7-value is given by the smallest 7 or, equivalently 
the highest approximation precision 



7* = arg max 1^ (a, a) 
7e[o,oo) 



(11) 



Another choice to be made in modeling is to select a suitable 
cost function for clustering i?(.,X). Let us assume that a 
number of cost functions {i?i(., X), i?2(., X), . . . , Rm{-, X)} 
are considered as candidates. The cost function to be selected 
is 

i?* (c, X) = arg max I^{a{R^),a{Rj)) (12) 

l<j<m 

where both the random variables a and a depend on R{c, X). 
The selection rule (12i prefers the model which is "expres- 
sive" enough to exhibit high information content (e.g., many 
clusters) and, at the same time robustly resists to noise in 
the data set. The bits or nats which are measured in the ASC 
communication setting are context sensitive since they refer to 
a hypothesis class C(X), i.e., how finely or coarsely functions 
can be resolved in C. 



VII. Computation of the approximation capacity 

To estimate the mutual information I^{(t,(j) compu- 
tationally, we have to calculate the size of the sets 
|C^(X(i))|, |C,(XW)|, \{a,}\, \ACs\. 

The cardinality |{crj}| is determined by the type of the 
empirical minimizer c-'"(X), i.e., the probabilities := 



(c-L(X(i)) = i^), 1 < < A: with 

= exp{nH(pi, . . 



(13) 



where nipi, . . . ,Pk) ^ -Y.t=iPi^^o&Pi. denotes the entropy 
of the type of c-L(X(i)), (a„ = 6„ 4:^ lim„^oo Meg = 0). 

The cardinality of the approximation sets can be estimated 
estimated using concepts from statistical physics. The ap- 
proximation sets C-y(X^^'^') are known as microcanonical 
ensembles in statistical mechanics. Estimating their size is 
achieved up to logarithmic corrections by calculating the 
partition function 



|C^(X(i'2) 



cec(x(i'2)) 



exp(-/3i?(c,X(i'2))). (14) 



The scaling factor /3, also know as inverse computational 
temperature, is determined such that the average costs of 
the ensemble C^(X(i)) yields R{c^,X^^^) + 7. The weights 
exp(— /3i?(c, X'^'^^)) are known as Boltzmann factors. 

The joint entropy in the mutual information, which is related 
to the intersection 



lACI = 



|(VjoC^(x(i))) nc^(x(2)) 

^{ce^oC^(x(i))}I{cec.,(x(2))} 

cec(x(2)) 

exp(-/3i?(V'"^ oc,X(i))) • 



E 

cec(x(2)) 



exp 



(-/3i?(c,X(2))), 



(15) 



involves a product of Boltzmann factors. 

The identification of approximation sets with microcanon- 
ical ensembles provides access to a rich source of compu- 
tational and analytical methods from statistical physics to 
calculate the mutual information I-){(J, a-). This analogy is by 
no means accidental since information theory and statistical 
mechanics are both specializations of empirical process theory 
with large deviation analysis of many particle systems. The 
central role of entropy and free energy is reflected in ASC 
coding where the logarithm of the partition function arises in 
the mutual information (|9]) twice. 

The cardinalities of the approximation sets can also be 
numerically estimated by sampling using Markov Chain Monte 
Carlo methods or by employing analytical techniques like 
deterministic annealing |9|, |3|. 

VIII. Why information theory for clustering 
validation? 

There exists a long history of information theoretic ap- 
proaches to model selection, which traces back at least to 



Akaike's extension of the Maximum Likelihood principle. AIC 
penalizes fitted models by twice the number of free param- 
eters. The Bayesian Information Criterion (BIC) suggests a 
stronger penalty than AIC, i.e., number of model parame- 
ters times logarithm of the number of samples. Rissanen's 
minimum description length principles is closely related to 
BIC (see e.g. Q for model selection penalties). Tishby et al 
|[TOl proposed to select the number of clusters according to a 
difference of mutual informations which is closely related to 
rate distortion theory with side information. 

All these information criteria regularize model estimation of 
the data source. Approximation set coding pursues a different 
strategy for the following reason: Quite often the measurement 
space X has a much higher "dimension" than the solution 
space. Consider for example the problem of spectral clustering 
with k groups based on dissimilarities D: The measurements 
are elements of for real valued, symmetric weights 

with vanishing self-dissimilarities, but we can at most distin- 
guish 0(/c") different clusterings. Any approach which relies 
on estimating the probability distribution P(X) of the data 
ultimately will fail since we require far too many observations 
than needed to identify one hypothesis or a set of hypotheses, 
i.e., one clustering or a set of clusterings. 

Using an information theoretic perspective, we might ask 
the question how the uncertainty in the measurements reduces 
the resolution in the hypothesis class. How similar can two 
hypotheses be so that they are still statistically distinguishable 
given a cost function i?(c, X)? This research program is based 
on the idea that approximation sets of clustering cost functions 
can be used as a reliable code. The capacity of such a coding 
scheme then answers the question how sensitive a particular 
cost function is to data noise. 

IX. Conclusion 

Model selection and validation requires to estimate the gen- 
eralization ability of models from training to test data. "Good" 
models show a high expressiveness and they are robust w.rt. 
noise in the data. This tradeoff between informativeness and 
robustness ranks different models when they are tested on new 
data and it quantitatively describes the underfitting/overfitting 
dilemma. In this paper we have explored the idea to use 
approximation sets of clustering solutions as a communica- 
tion code. Since clustering solutions with k clusters can be 
represented as strings of n symbols with a fc-ary alphabet, 
the significant problem of model order selection in clustering 
can be naturally phrased as a communication problem. The 
approximation capacity of a cost function provides a selection 
criterion which renders various models comparable in terms 
of their respective bit rates. The number of reliably extractable 
bits of a clustering cost function i?(.,X) define a "task 
sensitive information measure" since it only accounts for the 
fluctuations in the data X which actually have an influence 
on identifying an individual clustering solution or a set of 
clustering solutions. 

The maximum entropy inference principle suggests that we 
should average over the statistically indistinguishible solutions 



in the optimal approximation set C-y,t(X). Such a model 
averaging strategy replaces the original cost function with the 
free energy and, thereby, it defines a continuation methods 
with maximal robustness. The urgent question in many data 
analysis applications, which regularization term should be used 
without introducing an unwanted bias, is naturally answered 
by the entropy. The second question, how the regularization 
parameter should be selected, in answered by ASC: Choose the 
parameter value which maximizes the approximation capacity! 

ASC for model selection can be applied to all combina- 
torial or continuous optimization problems which depend on 
noisy data. The noise level is characterized by two samples 
X(i),X'^'^'. Two samples provide by far too little information 
to estimate the probability density of the measurements but 
two large samples contain sufficient information to determine 
the uncertainty in the solution space. The equivalence of 
ensemble averages and time averages of ergodic systems is 
heavily exploited in statistical mechanics and it also enables 
us in this paper to derive a model selection strategy based on 
two samples. 

Future work also includes the study of algorithmic com- 
plexity issues. The question how hard are properly regularized 
optimization problems hints at a relationship between compu- 
tational complexity and statistical complexity. 
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